Two reasons that we’ll be most interested in (to start):
It’s a great way to safely store organized versions of a project
It’s ideal for data science collaborators who need to share & edit code together
Watch this video before or after today’s session
Repositories (repos) on GitHub are the same unit as an RStudio Project - it’s a place where you can easily store all information/data/etc. related to whatever project you’re working on.
When we create a Repository in GitHub and have it communicating with a Project in RStudio, then we can get (pull) information from GitHub to RStudio, or push information from RStudio to GitHub where it is safely stored and/or our collaborators can access it. It also keeps a complete history of updated versions that can be accessed/reviewed by you and your collaborators at any time, from anywhere, as long as you have the internet.
A couple of general tips:
Pull frequently (if working with anyone else, and when you start working on a project again, even if on the same device)
Commit/push in small, meaningful increments and with useful (searchable, descriptive) Commit messages
The best way to deal with merge conflicts is to avoid creating merge conflicts. This is what happens when two people pull a version, work on it separately and then try to push back to the repo at the same time Communicate with collaborators so you’re not all working on the same piece of code at the same time.
PART 1. Fork & Clone an existing repo
PART 2. Create your own repo and version controlled R project from scratch
PART 3. Use the GitHub Classroom to fork a project, organise folders, run a script
END Make sure you complete this worksheet it will be vital for submitting your second summative assignment.
Become a master of reproducible research & version control with GitHub
Improve project structure
Write informative notes for collaborators & archiving
a. Go to github.com and log in (you need your own account - for sign up with your uea.ac.uk e-mail)
b. In the Search bar, look for repo Philip-Leftwich/5023Y-Happy-Git
c. Click on the repo name, and look at the existing repo structure
d. FORK the repo
e. Press Clone/download, copy the URL, and create a new project from Git repository in RStudio (add your URL) (Note you may be asked to enable Git and/or asked to provide your GitHub username and password)
f. Open the some_cool_animals.Rmd document, and the accompanying html
g. Add your name to the top of the document
h. BUT WAIT. We have forgotten to add a great image and facts about a very important species - Baby Yoda, including an image
Also known as “The Child”
likes unfertilised frog eggs & control knobs
strong with the force
i. Once you’ve added Grogu, knit the Rmd document to update the html
j. Stage, Commit & Push all files
Staged - pick those files which you intend to bind to a commit
Commit - write a short descriptive message, binds changes to a single commit
Push - “Pushes” your changes from the local repo to the remote repo on GitHub.
k. On GitHub, refresh and see that files are updated. Cool! Now you’ve used something someone else has created, customized it, and saved your updated version.
Today we will have a bit of a play with tidyverse tools and plotting to refresh your memory!
Remember if your code doesn’t run, it is usually a simple fix. Read the error message carefully and see how many of these bingo tiles you can pick up!
a. Go back to your GitHub account
b. Click on the plus sign (upper right, by your profile pic/icon) to create a new repository
c. Name the repo 5023Y-second-repo-yourinitials (like 5023Y-second-repo-PL), and select to initialize with a ReadMe
d. Edit the ReadMe (however you want - notice that markdown formatting works & you can see a preview) & commit
Some tips:
Include a title, introduction/objectives
e. Clone to create a connected R Project in RStudio
f. Create a new R Markdown document
g. Attach the {tidyverse},{palmerpenguins} and {plotly} packages in a hidden code chunk (include = FALSE)
h. Create a plot of PalmerPenguins, showing the output but not the code or messages
penguin_graph <- ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(size = body_mass_g,
color = species),
alpha = 0.4) +
scale_color_manual(values = c("purple","orange","black")) +
theme_minimal() +
labs(x = "bill length (mm)",
y = "bill depth mm",
title = "Penguin measurements")
penguin_graph### remember to not show the code in your RMD, you must include echo=FALSE in you code chunk
### ```{r, echo=FALSE}i. Make your plot interactive with the ggplotly function
j. Knit & save your .Rmd
k. Stage, commit & push back to GitHub
BLURB ON CLASSROOM
a. Follow this invite link
b. You will be invited to sign-in to Github (if not already) & to join the UEABIO organisation
c. Clone your assignment to work locally in RStudio
d. In your local project folder, create subfolders ‘data’ and ‘final_graphs’ (Note use the dir.create commands in the console)
e. Drop the file disease_burden.csv into the ‘data’ subfolder (Note unlike in Bash there is no base R command to move files so do this using the RStudio pane)
f. Open a new .R script
g. Attach the {tidyverse} and {janitor} package
h. Read in the disease_burden.csv data
This is a file look at the death rate for every country in the world across four decades.
i. Stage, commit & push at this point - notice that the empty folder ‘final_graphs’ doesn’t show up (won’t commit an empty folder)
j. Back in the script, write a short script to read and clean the data.
Use %>% and filter to pull out “United States”, “Japan”, “Afghanistan”& “Somalia”
We want to look at death for both sexes at 0-6 days so use filter here as well.
Assign this to a new object
library(tidyverse)
library(janitor)
db <- read_csv("data/disease_burden.csv") %>%
clean_names() %>%
rename(deaths_per_100k = death_rate_per_100_000)
# View(db)
# Subset (US, Japan = lowest infant death rates, Afghanistan = highest infant death rates)
db_sub <- db %>%
filter(country_name %in% c("United States", "Japan", "Afghanistan", "Somalia")) %>%
filter(age_group == "0-6 days", sex == "Both")k. Make a ggplot plotting the deaths per 100K by country across the four decades
HINT - use geom_line() and remember to separate countries by colour
ggplot(data = db_sub) +
geom_line(aes(x = year,
y = deaths_per_100k,
color = country_name)) +
scale_color_manual(values = c("black", "blue", "magenta", "orange"))l. Update your graph with direct labels (using annotate) and vertical or horizontal lines with geom_vline or geom_hline
# Graph
# New things: annotation + vertical line
ggplot(data = db_sub) +
geom_line(aes(x = year,
y = deaths_per_100k,
color = country_name)) +
scale_color_manual(values = c("black", "blue", "magenta", "orange")) +
annotate(geom = "text",
x = 1985,
y = 2.2e5,
label = "Afghanistan",
size = 2.5) +
geom_vline(xintercept = 2000,
lty = 2) +
theme_minimal()m. Use ggsave() to write your graph to a .png in the ‘final_graphs’ subfolder
n. Save, stage, commit & push
o. Check that changes are stored on GitHub
(NOTE this will be in your organisations rather than repos)
Make sure you finish all three exercises before next week to become a GitHub pro!!!!!!!!!
Want to learn about all things GitHub and R?(https://happygitwithr.com/)